8  Data Transformation:

Learning Objectives

After completing this lab you should be able to

For each of our modules we will have a project-folder with an Rproject, *.qmd-files, and sub-directories for data, scripts, and results as described in our Rproject Tutorial. You should have a directory on your Desktop or Documents folder on your laptop (name it something like bi349) as a home directory for all of our project folders this semester.

You should have already downloaded the project directory for this module, make sure the directory is unzipped and move it to your bi328 directory. You can open the Rproj for this module either by double clicking on it which will launch Rstudio or by opening Rstudio and then using File > Open Project or by clicking on the Rproject icon in the top right of your program window and selecting Open Project.

Once you have opened a project you should see the project name in the top right corner1.

  • 1 Pro tip: If you run into issues where a quarto document won’t render or file paths aren’t working (especially if things were working previously) one of your first steps should be to double check that the correct Rproj is loaded.

  • There should be a document names 08_data-transformation-ii.qmd in your project directory. Use that file to work through this tutorial - you will hand in your rendered (“knitted”) quarto file as your homework assignment. So, first thing in the YAML header, change the author to your name. You will use this quarto document to record your answers. Remember to use comments to annotate your code; at minimum you should have one comment per code set2 you may of course add as many comments as you need to be able to recall what you did]. Similarly, take notes in the document as we discuss discussion/reflection questions but make sure that you go back and clean them up for “public consumption”.

  • 2 You should do this whether you are adding code yourself or using code from our manual, even if it isn’t commented in the manual… especially when the code is already included for you, add comments to describe how the function works/what it does as we introduce it during the participatory coding session so you can refer back to it.

  • Let’s start by loading the packages we will need for this activity.

    # load libraries
    library(knitr)
    library(tidyverse)

    And we will want to make sure that we have read in our data set as a dataframe.

    # read catch data
    catch <- read_delim("data/longline_catchdata.txt", delim = "\t")

    8.1 Adding new variables

    So,turns out selecting columns and filtering based on content in rows is pretty straightforward.

    But frequently when we are processing our raw data sets we end up wanting to compute additional metrics or use the existing raw data to create new categories.

    The function mutate() can be used to create new columns. Frequently, this is done based on columns already existing in the data frame. This is a very powerful function with endless possibilities, but we are going to stick to some of the basics for now3.

  • 3 Rest assured if your answer is “Oh, could I …” the answer is “Yes”.

  • Let’s say you wanted create a column that contained the difference between the fork length and the stretch total length4:

  • 4 By default mutate() appends (adds) the new column as the last column. So we can see our results better we’ll used select() to move it to be the first column in the dataframe)

  • catch %>%
      mutate(difference = STL - FL) %>%
      select(difference, everything())
    # A tibble: 2,325 × 13
       difference Site      Species Sex   Observed_Stage   PCL    FL   STL Hook_Size
            <dbl> <chr>     <chr>   <chr> <chr>          <dbl> <dbl> <dbl>     <dbl>
     1         66 Aransas_… Bagre_… U     <NA>              NA   287   353        10
     2         70 Aransas_… Bagre_… U     <NA>              NA   425   495        10
     3         86 Aransas_… Bagre_… U     <NA>              NA   416   502        15
     4         91 Aransas_… Bagre_… U     <NA>              NA   416   507        10
     5         92 Aransas_… Bagre_… U     <NA>              NA   418   510        15
     6         81 Aransas_… Bagre_… U     <NA>              NA   434   515        10
     7         93 Aransas_… Bagre_… U     <NA>              NA   427   520        15
     8         86 Aransas_… Bagre_… U     <NA>              NA   446   532        10
     9         73 Aransas_… Bagre_… U     <NA>              NA   465   538        10
    10         89 Aransas_… Bagre_… U     <NA>              NA   450   539        10
    # ℹ 2,315 more rows
    # ℹ 4 more variables: Set <dbl>, Day <dbl>, Month <dbl>, Year <dbl>

    You should now have a column called difference at the end of the data frame5.

  • 5 Instead of - to substract, you can other mathematical operators such as + to add , * to multiple, and / to divide values when creating a new column.

  • Give it a whirl

    How would you create a new column called ratio, that is the ratio of the fork to stretch total length?

    Did it!

    [Your answer here]

    You can also create a column that contains a logical value (TRUE/FALSE). For example we might need a column that indicates if the Sex is unknown.

    catch %>%
      mutate(unknown_sex = Sex == "U") %>%
      select(unknown_sex, everything())
    # A tibble: 2,325 × 13
       unknown_sex Site     Species Sex   Observed_Stage   PCL    FL   STL Hook_Size
       <lgl>       <chr>    <chr>   <chr> <chr>          <dbl> <dbl> <dbl>     <dbl>
     1 TRUE        Aransas… Bagre_… U     <NA>              NA   287   353        10
     2 TRUE        Aransas… Bagre_… U     <NA>              NA   425   495        10
     3 TRUE        Aransas… Bagre_… U     <NA>              NA   416   502        15
     4 TRUE        Aransas… Bagre_… U     <NA>              NA   416   507        10
     5 TRUE        Aransas… Bagre_… U     <NA>              NA   418   510        15
     6 TRUE        Aransas… Bagre_… U     <NA>              NA   434   515        10
     7 TRUE        Aransas… Bagre_… U     <NA>              NA   427   520        15
     8 TRUE        Aransas… Bagre_… U     <NA>              NA   446   532        10
     9 TRUE        Aransas… Bagre_… U     <NA>              NA   465   538        10
    10 TRUE        Aransas… Bagre_… U     <NA>              NA   450   539        10
    # ℹ 2,315 more rows
    # ℹ 4 more variables: Set <dbl>, Day <dbl>, Month <dbl>, Year <dbl>

    You should know have a column called unknown_sex where if the animal that was caught was not sexed contains the value TRUE, if it was identified as male or female it would say FALSE.

    Give it a whirl

    How would you create a new column called post_2017 that is TRUE if fish were caught after 2017?

    Did it!

    [Your answer here]

    For that last problem, a “conditional mutate” using an ifelse statement (if this then do that, else do that) could have come in handy. Another option is case_when() which allows you to create multiple sets of conditions as opposed to ifelse which sets up a TRUE/FALSE dichotomy (file this information away for “maybe useful later”).

    8.2 group_by() and mutate()

    Many problems in data science require you to split your data set into subsets according to some grouping variable, apply a function, and then combine the results. dplyr is designed to make this straightforward; you have already sen an example of this while you were learning about filter().

    Similarly, you can combine mutate() with group_by().

    The function mean() will calculate the mean value of a vector of numbers, the argument na.rm=TRUE tells the function to ignore any NA-values in the data set.

    For example, let’s say you wanted to create a column that is the difference between the fork length of an individual and the mean fork length of that species.

    catch %>%
      group_by(Species) %>%
      mutate(diff_mean = FL-mean(FL, na.rm = TRUE))
    # A tibble: 2,325 × 13
    # Groups:   Species [14]
       Site     Species Sex   Observed_Stage   PCL    FL   STL Hook_Size   Set   Day
       <chr>    <chr>   <chr> <chr>          <dbl> <dbl> <dbl>     <dbl> <dbl> <dbl>
     1 Aransas… Bagre_… U     <NA>              NA   287   353        10     1    28
     2 Aransas… Bagre_… U     <NA>              NA   425   495        10     1    28
     3 Aransas… Bagre_… U     <NA>              NA   416   502        15     1    28
     4 Aransas… Bagre_… U     <NA>              NA   416   507        10     1    28
     5 Aransas… Bagre_… U     <NA>              NA   418   510        15     1    28
     6 Aransas… Bagre_… U     <NA>              NA   434   515        10     1    28
     7 Aransas… Bagre_… U     <NA>              NA   427   520        15     1    28
     8 Aransas… Bagre_… U     <NA>              NA   446   532        10     1    28
     9 Aransas… Bagre_… U     <NA>              NA   465   538        10     1    28
    10 Aransas… Bagre_… U     <NA>              NA   450   539        10     1    28
    # ℹ 2,315 more rows
    # ℹ 3 more variables: Month <dbl>, Year <dbl>, diff_mean <dbl>
    Give it a whirl

    How would you create a new column called that contains the difference between the fork length of an individual and the mean fork length of that species for each month?

    Did it!

    [Your answer here]

    8.3 Create new data.frame based on another

    Not infrequently we are more interested in summary (descriptive) stats of a data set rather than all the raw data - Tidyverse got you covered with the function summarize().

    For example, we might want to calculate the mean and standard deviation of the measured fork length.

    catch %>%
      summarize(mean_FL = mean(FL, na.rm = TRUE),
                sd_FL = sd(FL, na.rm = TRUE))
    # A tibble: 1 × 2
      mean_FL sd_FL
        <dbl> <dbl>
    1    406.  103.

    Remember, that earlier we’ve have used the function max() to obtain the largest value in a vector.

    Give it a whirl

    How could you use summarize to get the maximum forklength?

    Did it!

    [Your answer here]

    That’s cool but really we could have also just used

    mean(catch$FL, na.rm = TRUE)
    [1] 405.9179
    max(catch$FL, na.rm = TRUE)
    [1] 1140

    to get that information, since we are only interested in one column (vector).

    summarize() becomes especially powerful once we leverage group_by() to start calculating summary stats for entries grouped by a grouping variable.

    For example we can calculate summary stats by species and generate a table to include in a report.

    catch %>%
      group_by(Species) %>%
      summarize(mean_FL = mean(FL, na.rm = TRUE),
               median_FL = median(FL, na.rm = TRUE),
               max_FL = max(FL, na.rm = TRUE),
               min_FL = min(FL, na.rm = TRUE),
               sd_FL = sd(FL, na.rm = TRUE)) %>%
      ungroup()
    # A tibble: 14 × 6
       Species                    mean_FL median_FL max_FL min_FL sd_FL
       <chr>                        <dbl>     <dbl>  <dbl>  <dbl> <dbl>
     1 Bagre_marinus                 433.      445     575     45  65.4
     2 Carcharhinus_brevipinna       644.      648     900    489  69.2
     3 Carcharhinus_leucas           769       702    1140    624 167. 
     4 Carcharhinus_limbatus         613.      579     757    538 101. 
     5 Carcharhinus_porosus          415       415     415    415  NA  
     6 Hypanus_americanus            NaN        NA    -Inf    Inf  NA  
     7 Hypanus_sabina                NaN        NA    -Inf    Inf  NA  
     8 Rhinoptera_bonasus            NaN        NA    -Inf    Inf  NA  
     9 Rhizoprionodon_terraenovae    412       396     637    306  73.6
    10 Sciades_felis                 299.      297     480    102  41.9
    11 Sciaenops_ocellatus           793       793     841    745  67.9
    12 Sphyrna_lewini                471.      548.    578    210 174. 
    13 Sphyrna_tiburo                622.      605     861    370 114. 
    14 Synodus_foetens               173       173     173    173  NA  
    Consider this

    If you look closely you should see that you are getting a few NA, NaN, -Inf, and Inf values - any guesses why? You might want to pull up the catch data frame in the view panel to see what is going on with those species.

    Give it a whirl

    How could you use summarize() to calculate a range of summary stats for the stretch total length for individuals grouped by sex?

    Did it!

    [Your answer here]

    So far, we have been manipulating our data frame using code and printing it directly to the console (and our quarto document). This can be useful for example to generate tables for reports. However, in many cases we want to create a new object that has been manipulated according to our code and then we will further process, visualize, or analyze that dataframe down the line.

    summary <- catch %>%
      group_by(Species) %>%
      summarize(mean_FL = mean(FL, na.rm = TRUE),
               median_FL = median(FL, na.rm = TRUE),
               max_FL = max(FL, na.rm = TRUE),
               min_FL = min(FL, na.rm = TRUE),
               sd_FL = sd(FL, na.rm = TRUE)) %>%
      ungroup()

    When you execute this code, you’ll notice that the code (and probably a warning) is printed to the console but there is no output. Instead, if you look at the environment in the bottom left panel you should now see a new object called summary. Per usual, you can pull that up in the Editor/View pane (top left) using either View(summary) in the console or by clicking on the object in the environment.

    You will be presenting results in reports over the course of the semester, when you knit an quarto file you will get tables formatted in a standard way according to defaults in the resulting html file. If you want finer control over the output, you can use the kable() function. This will allow you to further format the table, for example, you may specify the number of digits printed using the argument digits =.

    By adding a chunk options for a label as #| label: tbl-sum-stats and table caption as #| tbl-cap: "Summary statistics for the forklength of each species, you can further modify the output that adheres to typical reporting standards for reports and research articles.

    kable(
      summary,
      digits = 1
    )
    Species mean_FL median_FL max_FL min_FL sd_FL
    Bagre_marinus 433.4 445.0 575 45 65.4
    Carcharhinus_brevipinna 643.7 648.0 900 489 69.2
    Carcharhinus_leucas 769.0 702.0 1140 624 167.3
    Carcharhinus_limbatus 613.2 579.0 757 538 101.0
    Carcharhinus_porosus 415.0 415.0 415 415 NA
    Hypanus_americanus NaN NA -Inf Inf NA
    Hypanus_sabina NaN NA -Inf Inf NA
    Rhinoptera_bonasus NaN NA -Inf Inf NA
    Rhizoprionodon_terraenovae 412.0 396.0 637 306 73.6
    Sciades_felis 298.9 297.0 480 102 41.9
    Sciaenops_ocellatus 793.0 793.0 841 745 67.9
    Sphyrna_lewini 470.8 547.5 578 210 174.4
    Sphyrna_tiburo 621.5 605.0 861 370 114.4
    Synodus_foetens 173.0 173.0 173 173 NA
    Table 8.1: Summary statistics for the forklength of each species in the catch data

    8.4 Combining verbs

    We’ve already combined most of our dplyr verbs with group_by().

    When you are wrangling data you will find that making use of the pipe (%>%) to combine select(), filter(), mutate(), and summarize() as a series of commands will be necessary to get your data set in the correct format and further process it.

    Give it a whirl

    Executre the following the code chunk. Then describe what each line is doing to manipulate the data frame.

    catch %>%
      select(-PCL, -Hook_Size) %>%
      separate(Species, into = c("genus", "species"), remove = TRUE) %>%
      unite(Date, Day, Month, Year) %>%
      filter(genus == "Carcharhinus" & Sex %in% c("F", "M")) %>%
      group_by(Site, genus, species, Sex) %>%
      filter(FL == max(FL)) %>%
      arrange(species)
    # A tibble: 11 × 9
    # Groups:   Site, genus, species, Sex [11]
       Site               genus species Sex   Observed_Stage    FL   STL   Set Date 
       <chr>              <chr> <chr>   <chr> <chr>          <dbl> <dbl> <dbl> <chr>
     1 Aransas_Bay        Carc… brevip… M     <NA>             640   792     3 22_9…
     2 Redfish_Bay        Carc… brevip… F     <NA>             900  1090     1 16_6…
     3 Redfish_Bay        Carc… brevip… M     <NA>             882  1092     2 16_6…
     4 Corpus_Christi_Bay Carc… brevip… M     <NA>             699   860     1 25_1…
     5 Corpus_Christi_Bay Carc… brevip… F     <NA>             704   880     4 2_10…
     6 Aransas_Bay        Carc… leucas  F     <NA>            1140  1410     1 25_5…
     7 Corpus_Christi_Bay Carc… leucas  F     YOY              694   854     2 10_6…
     8 Aransas_Bay        Carc… leucas  M     <NA>             812   912     4 22_9…
     9 Redfish_Bay        Carc… leucas  M     <NA>             840  1010     3 29_9…
    10 Aransas_Bay        Carc… limbat… F     <NA>             757   940     2 22_6…
    11 Corpus_Christi_Bay Carc… limbat… M     <NA>             610   770     2 30_8…
    Did it!

    [Your answer here]

    Generate the code that will manipulate the data frame as follows6:

  • 6 some bullet points may require more than one line of code; you do not have to perform the steps in the sequence presented, play around a little bit to see how to code this more efficiently

  • Give it a whirl

    Challenge 1:

    • order columns so Day, Month, Year, Set are at the beginning.
    • retain all male individuals in the genus Carcharhinus.
    • get rid of columns containing information on observed stage, precaudal length, and hook size
    Did it!

    [Your answer here]

    Give it a whirl

    Challenge 2:

    • create a new variable called Set_ID consisting of Day, Month, Year, and Set number.
    • determine the number of individuals per species per set7.
  • 7 There is a function called n() that allows us to count rows fulfilling a specific condition

  • Did it!

    [Your answer here]

    Give it a whirl

    Challenge 3:

    • remove all gafftops
    • calculate mean forklength for each species by sex and month of the year.
    Did it!

    [Your answer here]